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Abstract 

We introduce a factor analysis model that 
summarizes the dependencies between ob- 
served variable groups, instead of dependen- 
cies between individual variables as standard 
factor analysis does. A group may corre- 
spond to one view of the same set of objects, 
one of many data sets tied by co-occurrence, 
or a set of alternative variables collected from 
statistics tables to measure one property of 
interest. We show that by assuming group- 
wise sparse factors, active in a subset of the 
sets, the variation can be decomposed into 
factors explaining relationships between the 
sets and factors explaining away set-specific 
variation. We formulate the assumptions in 
a Bayesian model which provides the factors, 
and apply the model to two data analysis 
tasks, in neuroimaging and chemical systems 
biology. 



1 Introduction 

Factor analysis (FA) is one of the cornerstones of clas- 
sical data analysis. It explains a multivariate data 



set X € 



pNxD 



in terms of K < D factors that cap- 



ture joint variability or dependencies between the N 
observed samples of dimensionality D. Each factor 
or component has a weight for each dimension, and 
joint variation of different dimensions can be studied 
by inspecting these weights, collected in the loading 
matrix W e M^^^. Interpretation is made easier in 



the sparse variants of FA ( Knowles and Ghahramani 



2011 



[Paisley and Carml |2009 Rai and Daume III 



2009) which favor solutions with only a few non-zero 
weights for each factor. 

We introduce a novel extension of factor analysis, 
coined group factor analysis (GFA), for finding factors 
that capture joint variability between data sets instead 



of individual variables. Given a collection Xi, ...,Xm 
of M data sets of dimensionalities Di, Dm, the task 
is to find K < J2m=i ^™ factors that describe the col- 
lection and in particular the dependencies between the 
data sets or views X^. Now every factor should pro- 
vide weights over the data sets, preferably again in a 
sparse manner, to enable analyzing the factors in the 
same way as in traditional FA. 

The challenge in moving from FA to GFA is in how to 
make the factors focus on dependencies between the 
data sets. For regular FA it is sufficient to include a 
separate variance parameter for each dimension. Since 
the variation independent of all other dimensions can 
be modeled as noise, the factors will then model only 
the dependencies. For GFA that would not be suf- 
ficient, since the variation specific to a multi-variate 
data set can be more complex. To enforce the factors 
to model only dependencies, GFA hence needs to ex- 
plicitly model the independent variation, or structured 
noise, within each data set. We use linear factors or 
components for that as well, effectively using a princi- 
pal component analyzer (PCA) as a noise model within 
each data set. 

The solution to the GFA problem is described as a set 
of K factors that each contain a projection vector for 
each of the data sets having a non-zero weight for that 
factor. A fully non-sparse solution would hence have 
K X M projection vectors or, equivalently, K projec- 
tion vectors over the J2m=i -Dm -dimensional concate- 
nation of the data sources. That would, in fact, corre- 
spond to regular FA of the feature-wise concatenated 
data sources. They key in learning the GFA solution is 
then in correctly fixing the sparsity structure, so that 
some of the components will start modeling the vari- 
ation specific to individual data sets while some focus 
on different kinds of dependencies. 

An efficient way of solving the Bayesian GFA prob- 
lem can be constructed by extending (sparse) Bayesian 
Canonical Correlation Analysis (Archambeau and 



Bach 2009 ) from two to multiple sets and by replacing 
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variable-wise sparsity by group-wise sparsity as was 
recently done by iVirtanen et al. (2011). The model 



builds on insights from these Bayesian CCA models 



and recent non-Bayesian group sparsity works ( Jenat 



ton et al.[ |2010[ |Jia et al.[ |2010[ ) . The resulting model 
will operate on the concatenation Y — [Xi, Xa/] of 
the data sets, where the groups correspond to the data 
sets. Then the factors in the GFA (projection vectors 
over the dimensions of Y) become sparse in the sense 
that the elements corresponding to some subset of the 
data sets become zero, separately for each factor. The 
critical question is to which extent is the model able to 
extract the correct factors amongst the exponentially 
many alternatives that are active in any given sub- 
set of data sets. We empirically demonstrate that our 
Bayesian model for group-wise sparse factor analysis 
finds the true factors even from fairly large number of 
data sets. 

The main advantages of the model are that (i) it is con- 
ceptually very simple, essentially a regular Bayesian 
FA model with group-wise sparsity, and (ii) it enables 
tackling completely new kinds of data analysis prob- 
lems. In this paper we apply the model to two real- 
world example scenarios specifically requiring the GFA 
model, demonstrating how the GFA solutions can be 
interpreted. The model is additionally applicable to 
various other tasks, such as learning of the subspace 
of multi-view data predictive of some of the views, a 



problem addressed by Chen et al. (2010). 



The first application is analysis of fMRI measurements 
of brain activity. Encouraged by the recent success in 
discovering latent brain activity components in com- 



plex data setups ( 


Lashkari et al. 2010 


Morup et al. 


2010 


Varoquaux et al. 


2010 


) , we study a novel kind of 



an analysis setup where the same subject has been ex- 
posed to several different representations of the same 
musical piece. The brain activity measurements done 
under these different conditions are considered as dif- 
ferent views, and GFA reveals brain activity patterns 
shared by subsets of different conditions. For example, 
the model reveals "speech" activity shared by condi- 
tions where the subject listened to a recitation of the 
song lyrics instead of an actual musical performance. 

In the second application drug responses are studied 
by modeling four data sets, three of which contain gene 
expression measurements of responses of different cell 



lines (proxies for three diseases; (Lamb et al. 2006)) 
and one contains chemical descriptors of the drugs. 
Joint analysis of these four data sets gives a handle on 
which drug descriptors are predictive of responses in a 
specific disease, for instance. 



2 Problem formulation 

The group factor analysis problem, introduced 
in this work, is as follows: Given a collection 
Xi e M^^-°i,...,Xm e M^><-°*^ of data sets (or views) 
with N co-occurring observations, find a set of K fac- 
tors that describe the joint data set Y = [Xi, Xm]. 
Each factor is a sparse binary vector e over 
the data sets, and the non-zero elements indicate that 
the factor describes dependency between the corre- 
sponding views. Furthermore, each active pair (fac- 
tor fc, data set m) is associated with a weight vector 
Wm.fe G M^'" that describes how that dependency is 
manifested in the corresponding data set m. The w„i_fc 
correspond to the factor loadings of regular FA, which 
are now multivariate; their norm reveals the strength 
of the factor and the vector itself gives more detailed 
picture on the nature of the dependency. 

The f s have been introduced to make the problem for- 
mulation and the interpretations simpler; in the spe- 
cific model we introduce next they will not be repre- 
sented explicitly. Instead, the weight vectors w^^fe are 
instantiated for all possible factor-view pairs and col- 
lected into a single loading matrix W, which is then 
made group-wise sparse. 

3 Model 

We solve the GFA problem with a group-wise sparse 
matrix factorization of Y, illustrated in Figure [l] The 
variable groups correspond to the views 1,...,M and 
the factorization is given by 

Y « ZW^, 

where we have assumed zero-mean data for simplic- 
ity. The factorization gives a group-wise sparse weight 
matrix W e MP'^^ and the latent components Z e 
j^ATxif^ The weight matrix W collects the factor- and 
view-specific projection vectors Wm^fc for all pairs for 
which fm.fc = 1. The rest of the elements in W are 
filled with zeroes. 

We solve the problem in the Bayesian framework, pro- 
viding a generative model that extracts the correct 
factors by modeling explicitly the structured noise on 
top of the factors explaining dependencies. We assume 
the observation model 



Y = ZW 



E, 



where each row of the Gaussian noise E has diago- 
nal noise covariance with the diagonal of \a\^ ...^a\j\ 
where cr^„ has been repeated Dm times. That is, every 
dimension within the same view has the same residual 
variance, but the views may have different variances. 
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Figure 1: Illustration of the group factor analysis of three data sets or views. The feature-wise concatenation 
of the data sets is factorized as a product of the latent variables Z and factor loadings W. The factor 
loadings are group-wise sparse, so that each factor is active (gray shading, indicating irn,k = 1) only in some 
subset of views (or all of them). The factors active in just one of the views model the structured noise, variation 
independent of all other views, whereas the rest model the dependencies. The nature of each of the factors is 
learned automatically, without needing to specify the numbers of different factor types (whose number would be 
exponential in the number of views) beforehand. 



To complete the generative formulation, we assume the 
rows of Z to have a zero-mean Gaussian distribution 
with unit covariance. That is, each factor is repre- 
sented with a Gaussian latent variable and all factors 
are a priori independent. 

The weight matrix W is made sparse by assuming a 
group-wise automatic relevance determination (ARD) 
prior, 

^ Gamma(ao, &o) 

p(w) = n n n-^(^™.fc('^)iO'<fc) , 

k=l m=l d=l 

where MVm.kid) denotes the dth element in the projec- 
tion vector w,„^fc, the vector corresponding to the mth 
view and fcth factor. The inverse variance of each vec- 
tor is governed by the parameter am,k which has a 
Gamma prior with a small oq and bo (we used 10"^**). 
The ARD makes groups of variables inactive for spe- 
cific factors by driving their to zero. The com- 
ponents used as modeling the structured noise within 
each data set are automatically produced as factors 
active in only one view. 

Since the model is formulated through a sparsity prior 
we do not explicitly need to represent F in the model. 
It can, however, be created based on the factor-specific 
relative contributions to the total variance of each 
view, obtained by integrating out both z and W. We 
set f™ t = 1 if 



<!fc > e (Tr(S„) - al) /D„ 



(1) 



where Tr(S„i) is the total variance of the mth view 
and e is a small threshold constant. 

The inference is based on a variational approxima- 
tion. We build on the approximation provided for 
Bayesian FA by'Luttinen and Ilin (2010), re-utilizing 



of the model. The only differences are that the pos- 
terior approximation for a needs to be updated for 
each factor-view pair separately, ti^ are view-specific 
instead of feature-specific, and the parts of W corre- 
sponding to different views are updated one at a time; 
the detailed update formulas are not repeated here due 
to the close similarity. 

To solve the difficult problem of fixing the rotation 
in factor analysis models, we borrow ideas from the 



recent solution by Virtanen et al. (2011) for CCA 



models. Between each round of the EM updates we 
maximize the variational lower bound with respect 
to a linear transformation R of the latent subspace, 
which is applied to both W and Z so that the prod- 
uct ZW"^ remains unchanged. That is, Z — ZR-^ and 
— R^-^W"^. Given the fixed likelihood, the opti- 
mal R corresponds to a solution best matching the 
prior that assumes independent latent components, 
hence resulting in a posterior with maximally uncorre- 
lated components. We optimize for R by maximizing 



L = - ^Tr(R-i(Z'^Z)R 



Clog|R| 



M 
m—1 



K 



\og\{rl{V^ly^m)rk 



(2) 



fc=i 



the EM-style update rules for all of the parameters 



with the L-BFGS algorithm for unconstrained opti- 
mization. Here C = — N, and is the fcth 
column of R, and the (Z-^Z) = J2ni^n^n) collects the 
second moments of the factorization. Similar notation 
is used for W^, which indicates the part of W corre- 
sponding to view TO. 

3.1 Special cases and related problems 

When Dm = 1 for all to the problem reduces to regular 
factor analysis. Then the w,„_fe are scalars and can be 
incorporated into ffc to reveal the factor loadings. 

When M = 1, the problem reduces to probabilistic 
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principal component analysis (PCA), since all the fac- 
tors are active in the same view and they need to de- 
scribe all of the variation in the single-view data set 
with linear components. 

When M — 2, the problem becomes canonical corre- 



lation analysis (CCA) as formulated by Archambeau 
and Bach| ( |2009[ ) and [Virtanen et al.| ( |2011[ ). ThiTls 
because then there are only three types of factors. Fac- 
tors active in both data sets correspond to the canon- 
ical components, whereas factors active in only one of 
the data sets describe the residual variation in each 
view. Note that most multi-set extensions of CCA 
applicable for M > 2 data sets, such as those by |Ar-| 



chambeau and Bach (2009); Deleus and Hulle (20111, 



do not solve the GFA problem. This is because they 
do not consider components that could be active in 
subsets of size / where 2 < / < A/, but instead re- 
strict every component to be shared by all data sets 
or to be specific to one of them. 

A related problem formulation has been studied in 
statistics under the name multi-block data analysis. 
The goal there is to analyze connections between 
blocks of variables, but again the solutions typically 



assume factors shared by all blocks ( Hanafi and Kiers 



2006). The model recently proposed by Tenenhaus 



and Tenenhaus (2011) can find factors shared by only 



a subset of blocks by studying correlations between 
block pairs, but the approach requires specifying the 
subsets in advance. 



Recently Jia et al. (2010) proposed a multi-view learn- 



ing model that seeks components shared by any subset 
of views, by searching for a sparse matrix factoriza- 
tion with convex optimization. However, they did not 
attempt to interpret the factors and only considered 
applications with at most three views. 



Knowles and Ghahramani ( 2007 1 suggested that regu- 



lar sparse FA (Knowles and Ghahramani 2011 ) could 



be useful in a GFA-type setting. They applied sparse 
FA to analyzing biological data with five tissue types 
concatenated in one feature representation. In GFA 
analysis the tissues would be considered as different 
views, revealing automatically the sparse factor- view 
associations that can only be obtained from sparse FA 
after a separate post-processing stage. In the next sec- 
tion we show that directly solving the GFA problem 
outperforms the choice of thresholding sparse FA re- 
sults. 



4 Technical demonstration 

For technical validation of the model, we applied it to 
simulated data that has all types of factors: Factors 
specific to just one view, factors shared by a small sub- 



set of views, and factors common to most views. We 
show that the proposed model can correctly discover 
the structure already with limited data, while demon- 
strating that possible alternative methods that could 
be adapted to the scenario do not find the correct re- 
sult. We sampled M = 10 data sets with dimension- 
alities Dm ranging between 5 and 10 (X)m -^"i = '''2), 
using a manually constructed set of -fC = 24 factors of 
various types. 

For comparing our Bayesian GFA model with alter- 
native methods that could potentially find the same 
structure, we consider the following constructs: 

• FA: Regular factor analysis for the concatenated 
data Y. The model assumes the correct number 
of factors, K = 2A. 

• BFA: FA with an ARD prior for columns of W, 
resulting in a Bayesian FA model that infers the 
number of factors automatically but assumes each 
factor to be shared by all views. 

• NSFA: Fully sparse factor analysis for Y. We use 
the nonparametric sparse FA method by [Knowles 



and Ghahramani (2011 ) which has an Indian buf- 



fet process formulation for inferring the number 
of factors. 

With the exception of the simple FA model, the alter- 
natives are comparable in the sense that they attempt 
to automatically infer the number of factors, which 
is a necessary prerequisite for modeling collections of 
several datasets, and that they are based on Bayesian 
inference. 

The solution for the GFA problem is correct if the 
model (i) discovers the correct sparsity structure F 
and (ii) the weights w„j_fc mapping the latent variables 
into the observations are correct. Since the methods 
provide solutions of a varying number of factors and 
do not preserve the order of factors, we use a simi- 



larity measure (Knowles and Ghahramani 2011) that 



chooses an optimal re-ordering and sign for the factors 
found by the models, and then measures the mean- 
square error. 

We start by measuring property (ii) , by inspecting the 
similarity of the true and learned loading matrix W. 
The GFA finds the mappings much more accurately 
than the alternative methods (Fig. [5]). Next we in- 
spect the property (i), the similarity of the true and 
estimated F, again using the same measure but for 
binary matrices. Since FA and BFA do not enforce 
any kind of sparsity within factors, we only compare 
GFA and NSFA. For GFA we obtain F by thresholding 
the ARD weights using ([I]), whereas for NSFA we set 
fm,fc = 1 if any weight within Mv^^k is non-zero. GFA 
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Figure 2: Top: Difference (mean square error MSB) 
between the estimated and true loading matrix W as a 
function of the sample size N. Our Bayesian group fac- 
tor analyzer (GFA) directly solving the GFA problem 
is consistently the best, beating state-of-the-art non- 
parametric sparse factor analysis (NSFA), regular fac- 
tor analysis (FA) and Bayesian FA (BFA) . It is worth- 
while to notice that increasing the sample size does 
not help the alternative methods to obtain the correct 
solution. Bottom: Difference between the estimated 
and true factor activity matrix F, shown for the two 
methods providing sparse representations. Again GFA 
outperforms NSFA. 

again outperforms NSFA which has not been designed 
for the task. By adaptive thresholding of the weight 
sets of NSFA it would be possible to reach almost the 
accuracy of GFA in discovering F, but as shown by 
the comparison of W the actual factors would still be 
incorrect. 

Next we proceed to inspect how well GFA can extract 
the correct structure from different kinds of data col- 
lections, with a particular focus on discovering whether 
it biases specific types of factors. If no such bias is 
found, the experiments give empirical support that 
the method solves the GFA problem in the right way. 
For this purpose we constructed data sets with a fixed 
number of samples {N — 100) and a varying num- 
ber of views M that are all D — 10 dimensional. We 
created data collections with three alternative distri- 
butions over the different kinds of factors, simulating 
possible alternatives encountered in real applications. 



The first distribution type has one factor of each possi- 
ble cardinality and the second shows a power-law dis- 
tribution with few factors active in many views and 
many very sparse factors. Finally, the third type has a 
uniform distribution over the subsets, resulting in the 
cardinality distribution following binomial coefficients. 

Figure [3] shows the true and estimated distribution of 
factor cardinalities (the number of active views in a 
factor) for the different data collections. For all three 
cases, the model finds the correct structure for M = 40 
views, and in particular models correctly both types of 
factors: those shared by several views and those spe- 
cific to only one or a few. Besides checking for the cor- 
rect cardinalities, we inspected that the actual factors 
found by the model match the true ones. We then pro- 
ceeded to demonstrate (Fig. [s] (d)), for the case with 
uniform distribution over factor cardinalities, that the 
finding holds for all numbers of views below M = 60 
for this case with just N — 100 samples; for other 
distributions the results are similar (not shown). 



5 Application scenarios and 
interpretation 

Next, we apply the method to brain activity and drug 
response analysis, representative of potential use cases 
for GFA, and show how the results of the model can 
be interpreted. Both applications contain data of mul- 
tiple views (7 and 4, respectively) and could not be di- 
rectly analyzed with traditional methods. The number 
of views is well within the range for which the model 
was demonstrated above to find the correct structure 
from the simulated data. 

Both applications follow the same analysis procedure, 
which can be taken as our practical guidelines for 
data-analysis with GFA. First, the model is learned 
with sufficiently many factors, recognized as a solu- 
tion where more than a few factors are left at zero. 
Solutions where all the factors are in use cannot be re- 
lied upon, since then it is possible that a found factor 
describes several of the underlying true factors. After 
that, the factor activity matrix F, ordered suitably, is 
inspected in search for interesting broad-scale proper- 
ties of the data. In particular, the number of factors 
specific to individual views is indicative of the com- 
plexity of residual variation in the data, and factors 
sharing specific subsets of views can immediately re- 
veal interesting structure. Finally, individual factors 
are selected for closer inspection by ordering the fac- 
tors according to an interest measure specific to the 
application. Wc demonstrate that measures based on 
both Z and W are meaningful. 
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Figure 3: The GFA model finds the correct sparsity structure F for three very different distributions over 
different types of factors in the data. The thick grey line shows the true latent structure as the cardinalities 
(number of active views) of the factors in the decreasing order, and the overlaid curves show results of 10 runs on 
different simulated data. The first three plots show results for M = 40 views with three different distributions 
of factor types ((a): uniform over the cardinality, (b): power law over the cardinality, (c): uniform over the view 
combinations), and for all cases the model learns the correct structure. The last plot (d) shows that the behavior 
is consistent over a wide range of M (the different curves) and only starts to break down when M approaches 
the sample size TV — 100. 



5.1 Multi-set analysis of brain activity 

We analyze a data collection where the subject has 
been exposed to audiovisual music stimulation. The 
setup is uniquely multi-view; each subject experienced 
the same three pieces of music seven times as different 
variations. For example, in one condition the subjects 
viewed a video recording of someone playing the piece 
on piano, in one condition they only heard a duet of 
singing and piano, and in one they saw and heard a 
person reading out the lyrics of the song. 

The M = 7 different exposure conditions or stimuli 
types were used as views, and we sought factors tying 
together the different conditions by applying GFA to a 
data set where the fMRI recordings of the 10 subjects 
were concatenated in the time direction. Each sam- 
ple {N = 1162) contained, for each view, activities of 
Dm = 32 regions of interest (ROI) extracted as aver- 
ages of local brain regions. The set of 32 regions was 
chosen by first picking the five most correlating ROIs 
for each of the 21 possible pairs of listening conditions 
and then taking the union of these choices. 

The factor activity matrix F of factors times views, of 
a 200-component solution, is shown in Figure |4] The 
number of factors is sufficient, as demonstrated by the 
roughly 40 empty factors, and we see that the data 
contain large blocks of view-specific factors suggesting 
a lot of noise in the data. To more closely examine the 
factors, we chose to study factors that are coherent 
over the users. We split the latent variables € 
corresponding to each factor into 10 sequences corre- 
sponding to the samples of the 10 subjects, and mea- 
sured the inter-subject correlation (ISC) of each factor. 
The factors were then ranked according to ISC (Fig.|4j 



bottom), revealing a few components having very high 
synchrony despite the model being ignorant that the 
data consisted of several users. 

The strongest ISC correlation is for a component 
shared by all views. It captures the main progres- 
sion of the music pieces irrespective of the view. A 
closer inspection of the weight vectors reveals that the 
responses in the different views are in different brain 
regions according to the modality; the four conditions 
with purely auditory stimuli have weights revealing au- 
ditory response, whereas the three conditions with also 
visual stimulation activate also vision-related regions. 
The second-strongest ISC correlation is for a compo- 
nent shared by just two views, speech under both au- 
diovisual and purely auditory conditions. That is, the 
component reveals a response to hearing recitation of 
the song lyrics instead of actual music as in the other 
conditions. 

The solution is consistent with how GFA is meant to 
work. In this specific application, it might be useful to 
additionally apply a model where solutions with simi- 
lar Wm.fc across all m with fm.k = 1 would be favored. 
The components would then be directly interpretable 
in terms of specific brain activity patterns, in addition 
to the time courses. 

5.2 Multi-set analysis of drug responses 

The second case study is a novel chemical systems biol- 
ogy application, where the observations are drugs and 
the first A/ — 1 views are responses of different cell 
lines (each being a proxy to a different disease) to the 
specific drug. The Mth view contains features describ- 
ing chemical properties of the drug, derived from its 
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Figure 4: Multi-set analysis of brain activity. Top: 
The matrix F of factors (rows) across views (columns) , 
indicating the dependencies between the views. The 
different views correspond to different versions of the 
same songs being played for the subject. Bottom: 
Sorting the factors as a function of inter-subject cor- 
relation reveals which factors robustly capture the re- 
sponse to the stimulus. 



structure. The interesting questions are can GFA find 
factors relating drug structures and diseases, or relat- 
ing different cell lines which are here different cancer 
subtypes. 

As the biological views, we used activation profiles over 
Di ^ D2 = D3 = 1321 gene sets in M — 1 = 3 can- 



cer cell lines compared to controls, from (Lamb et al. 



2006 ) . The 4th (chemical) view consisted of a standard 



type of descriptors of the 3D structure of the drugs, 
called VolSurf {D4 — 76). The number of drugs for 
which observations had been recorded for all cell lines 
was iV = 684. 

The factor-view matrix F of a 600-componcnt GFA 
solution is shown in Figure [s] (top) . The number of 
components is large enough since there are over 100 
empty factors. Four main types of factors were found: 
(i) Factors shared by the chemical view and one (or 
two) cell lines (zoomed inset in Fig [5] top) . They give 



Figure 5: Multi-set analysis of drug responses. Top: 
Factor activity matrix of the factors (rows) against the 
3 biological views (columns), cell lines HL60, MCF7, 
PC3, and the chemical view of drug descriptors. The 
small matrix at the bottom shows a zoomed inset to 
an interesting subset of the factors. Bottom: Mean 
average precision of retrieving drugs having similar ef- 
fects (targets), based on the first N GFA factors. Inte- 
gration of the data sources gives a significantly higher 
retrieval performance than any data source separately. 



hypotheses for the specific cancer variants, (ii) Factors 
shared by all cell lines and the chemical space, repre- 
senting effects specific to all subtypes of cancer, (iii) 
Factors shared by all cell lines but not the chemical 
space. They are drug effects not captured by the spe- 
cific chemical descriptors used. The fact that there is 
a large block of over 200 of these factors fits well with 
the known fact that VolSurf features are a very limited 
description of the complexity of drugs, (iv) Factors 
specific to one biological view. These represent either 
"biological noise" or then drug effects specific to that 
cancer subtype, again not captured by the VolSurf fea- 
tures. Finally, the small set of components active only 
in the chemical view correspond to structure in VolSurf 
having no drug effect. 

We inspected some of the factors more carefully, and 
more detailed biological analysis is on-going. The first 
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factor of type (i), shared by one cell line and the chem- 
ical descriptors, activates genes linked to inflamma- 
tory processes on the biological side, and is active in 
particular for non-steroidal anti-inflammatory drugs 
(NSAIDs), especially ibuprofen-like compounds, which 



are known to have anti-cancer effects (Ruegg et al. 



2003). Of the factors shared by all cell lines (type iii) 



the one with the highest norm of the weight vectors 
shows strong toxic effects on all cell lines, being linked 
to stopping of cell growth and apoptosis. In summary, 
these first findings are well in line with established 
knowledge, and digging into the further components is 
on-going. 

We next validated quantitatively the ability of the 
model to discover biologically meaningful factors. We 
evaluated the performance of the found set of factors 
in representing drugs in the task of retrieving drugs 
known to have similar effects (having the same tar- 
gets). 

We represented each drug with the corresponding vec- 
tor in the latent variable matrix Z, and used corre- 
lation along the vectors as a similarity measure when 
retrieving the drugs most similar to a query drug. Re- 
trieval performance was measured as the mean aver- 
age precision of retrieving drugs having similar effects 
(having the same targets). As baselines we computed 
the distances in only the biological views or in only 
the chemical view. Representation by the GFA fac- 
tors outperforms significantly (t-test, p < 0.05) using 
either space separately (Fig. |5] bottom) . The experi- 
ment was completely data driven except for one bit of 
prior knowledge: As the chemical space is considered 
to be the most informative about drug similarity, the 
factors were pre-sorted by decreasing Euclidean norm 
of the weight vectors in the chemical space. 

6 Discussion 

We introduced a novel problem formulation of finding 
factors describing dependencies between data sets or 
views, extending classical factor analysis which does 
the same for variables. The task is to provide a set of 
factors explaining dependencies between all possible 
subsets of the views. For solving the problem, coined 
group factor analysis (GFA), we provided a group- wise 
sparse Bayesian factor analysis model by extending a 



recent CCA formulation by Virtanen et al. (2011) to 
multiple views. The model was demonstrated to be 
able to find factors of different types, including those 
specific to just one view and those shared by all views, 
equally well even for high numbers of views. We ap- 
plied the model to data analysis in new kinds of appli- 
cation setups in neuroimaging and chemical systems 
biology. 



The variational approximation used for solving the 
problem is computationally reasonably efficient and is 
directly applicable to data sets of thousands of sam- 
ples and several high-dimensional views, with the main 
computational cost coming from a high number of fac- 
tors slowing down the search for an optimal rotation 
of the factors. It would be fruitful to develop (approx- 
imative) analytical solutions for optimizing Eq. [2] nec- 
essary for the model to converge to the correct sparsity 
structure, which would speed up the algorithm to the 
level of standard Bayesian PCA/FA. 

The primary challenge in solving the GFA problem is 
in correctly detecting the sparsity structure. Our so- 
lution was demonstrated to be very accurate at least 
for simulated data, but it would be fruitful to study 
how well the method fares in comparison with alter- 
native modeling frameworks that could be adapted to 
solve the GFA problem, such as the structured sparse 



matrix factorization by Jia et al. ( 2010 ) or extensions 



of nonparametric sparse factor analysis (Knowles and 



Ghahramani 2011 ) modified to support group spar- 



sity. It could also be useful to consider models that 
are group-wise sparse but allow sparsity also within 
the active factor- view groups or sparse deviations from 
zero for the inactive ones, with model structures along 



the lines Jalali et al. (2010) proposed for multi-task 
learning. 
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